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It  is  shown  that  the  convergence  of  these  algorithms  can  be  accelerated  by  con¬ 
trolling  the  implementation  of  the  hnite  differences.  Particularly,  it  is  shown 
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optimization  for  a  broad  class  of  problems,  in  the  iteration  number  n. 
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1.  Introduction.  Let  i?  denote  the  set  of  real  numbers.  Consider  a  real- valued  function 
J{d)  of  the  form  J{d)  =  Ex[L{X{d)y\  where  0  is  a  parameter,  or  a  vector  of  parameters, 
L{X)  is  a  real- valued  function,  and  X{d)  is  a  random  variable  that  depends  on  9.  For 
simplicity,  throughout  this  paper  we  assume  that  0  is  a  scalar  and  6  &  Q  G  R,  and  X{d) 
is  of  the  form  X{6)  =  X{6,^),  where  is  a  random  variable  independent  of  6.  In  such  a 
formulation,  X (9)  is  parameterized  on  an  underlying  probability  space  that  is  independent  of 
9.  For  any  two  random  variables  rj  and  there  exists  a  Borel  function  0  such  that  ?]  = 
[Shiryayev  (1984),  p.l72].  Such  a  representation  for  X{9,^)  is  always  possible.  Therefore, 
J{9)  can  be  written  as  J{9)  =  E^[L{X{9,^))].  We  are  particularly  interested  in  hnding  an 
optimal  parameter  6**  G  0  to  optimize,  say  minimize,  J{9).  This  is  a  challenging  problem 
since  the  analytical  form  of  J{9)  is  usually  unavailable  for  most  problems  of  interest.  What 
is  obtainable  are  the  sample  measurements  of  the  random  value  of  L{X{9,^)).  We  have  to 
use  the  information  on  L{X{9,^))  to  find  9*.  Such  stochastic  optimization  problems  can 
be  found  in  many  applications.  The  main  approach  to  hnding  the  optimal  solution  is  to 
successively  approximate  9*  via  algorithms  of  stochastic  approximation.  This  is  a  classical 
and  standard  approach  that  has  been  adopted  in  practice  for  decades.  The  Robbins-Monro 
(RM)  algorithm,  the  Kiefer- Wolf owitz  (KW)  algorithm,  and  the  relatively  recent  mirror 
descent  (MD)  algorithm  are  the  most  popular  algorithms  of  this  class. 

The  RM  algorithm,  introduced  by  Robbins  and  Monro  (1951),  hnds  9*  in  the  following 
way.  Let  9o  be  selected  and  {a„}  a  sequence  of  positive  numbers.  For  each  integer  n  >  0,  let 

(1)  ^n+l  9ji  OjjiQn 

where  pn  is  an  unbiased  estimate  of  the  derivative  J'{9)  of  J{9)  with  respect  to  9.  Assume 
that  J'{9)  exists  on  0  and  the  variance  of  Pn  is  uniformly  bounded  for  all  n.  Assume 
{9-9*)J'{9)  >0for  all^^r, 

J2an  =  oo,  <  cx), 

n  n 

and  that  several  other  technical  conditions  are  satished.  Then  {9n}  converges  to  9*  with 
probability  one.  The  convergence  rate  (in  terms  of  root  mean  square  error)  is  .  Note 
that  this  is  the  best  possible  rate  of  convergence  for  algorithms  of  the  form  (1)  for  stochastic 
optimization  [see,  e.g.,  Fabian  (1971)]. 

The  KW  algorithm,  introduced  by  Kiefer  and  Wolfowitz  (1952),  is  a  modihcation  of  the 
RM  algorithm  by  approximating  the  gradient  using  a  finite  difference  and  hnds  9*  recursively 


1 


by 

(2)  ^n+l  ~  Ojnhni 


where 

(3) 


hn 


L{X{9n  +  (5n,  6,n))  -  L{X{9n  -  Sn,  6,n)) 

26n 


{(5„}  is  a  sequence  of  positive  numbers,  L{X{9n  +  5n,^i,n))  and  L{X{9n  —  ^n,&,n))  are  two 
measurements  of  L{X{9,^))  at  9n  +  and  9n  —  Sn,  and  ^i,n,^2,n  are  corresponding  samples 
of  Kiefer-Wolfowitz  (1952)  proved  that  if  J{9)  is  decreasing  for  9  <  9*  and  increasing  for 
9  >  9*,  and  if 


-t  0,  ^an  =  00,  '^anSn<00,  <CX), 

n  n  n 

the  sequence  {9n}  converges  to  9*  with  probability  one  under  some  additional  minor  condi¬ 
tions.  If  all  entries  in  are  mutually  independent,  the  best  possible  convergence  rate  for 

the  KW  algorithm  (2)  is  which  is  achieved  by  choosing  a„  =  an~^,5n  =  with 

a,d>0  constants  [e.g.  Burkholder  (1956);  Fabian  (1971);  Sacks  (1958)].  The  rate  is 
regarded  not  satisfactory  compared  to  the  best  possible  rate  for  the  RM  algorithm. 

The  MD  algorithm,  introduced  by  Nemirovski  and  Yudin  (1983),  improves  the  robustness 
of  gradient  based  optimization  algorithms.  At  iteration  n  >  0,  9n+i  is  updated  via  solving 


(4) 


9n+i  =  argmin^ge  i<hn,9  >  +—D^{9,  6'„)| , 


where  hn  is  an  estimate  of  the  derivative  J'{9),  H  :  0  x  0  — )■  R'^  is  a  Bregman  distance 
dehned  as 

(5)  D{9,t)  :=  iIj{9)  —  t/’(r)—  <  iIj'{t),9  —  r  >>  k\\9  —  r||^. 


where  is  a  distance  generating  function  and  k  >  0  is  a  constant.  In  (5),  ||.||  is  a  general 
norm  on  R™  (and  on  R  in  this  paper).  It  has  been  established  by  Nemirovski  et  ah  (2009) 
and  Duchi  et  ah  (2012,2013)  that  if  J{9)  is  convex,  Lipschitz  continuous  and 


dn  ^  0,  ^  ]  dn  OC, 

n 

1  ^  ^  ^  a- 

9n  =  =  - A  =  1,2,... 

then  J{9n)  converges  to  the  minimum  of  J{9)  and  the  rate  of  convergence  is  under 

mild  technical  conditions  that  will  be  specihed  in  Section  3. 

The  convergence  of  these  algorithm  is  fairly  understood  [Burkholder  (1956);  Fabian 
(1971);  Kushner  and  Clark  (1978);  Chung  (1954);  Dupac  (1957);  Dvoretzky  (1956);  Sacks 
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(1958);  Wasan  (1969),  Nemirovski  et  al.  (2009),  Duchi  et  al.  (2012,  2013)].  The  conditions 
for  the  convergence  of  these  algorithms  can  be  made  substantially  weaker  than  those  we  have 
previously  mentioned  [e.g.  Kushner  and  Clark  (1978);  Wasan  (1969)].  The  convergence  rate 
for  the  RM  algorithm  is  much  faster  than  that  for  the  KW  algorithm.  This  is  not  surprising 
if  we  note  that  the  KW  algorithm  uses  the  hnite  difference  hn  as  an  approximation  to  the 
derivative  J'{0),  while  the  RM  algorithm  uses  an  unbiased  estimate  of  J'{9).  Therefore,  the 
faster  rate  is  achieved  at  the  cost  of  obtaining  an  unbiased  estimate  of  the  derivative  J’{0) 
that  is  often  challenging  in  practice.  On  the  other  hand,  although  its  convergence  is  slower, 
the  KW  algorithm  requires  no  detailed  information  on  the  function  J{d).  It  is  simple  to  use 
and  applicable  to  a  wide  range  of  problems.  Kesten  (1958)  suggests  that  the  stepsize  be 
chosen  according  to  the  fluctuation  in  the  signs  of  Qn  and  hn-  A  few  of  other  techniques 
for  the  acceleration  of  stochastic  approximation  algorithms  can  be  found  in  Wasan  (1986). 
None  of  these  accelerating  techniques  can  improve  the  rate  of  convergence  of  the  algorithms 
under  study. 

In  this  paper,  we  are  interested  in  the  acceleration  of  the  KW  algorithm  and  the  MD 
algorithm  through  controlling  the  estimation  of  the  derivative  using  hnite  differences.  Fur¬ 
thermore,  we  consider  the  employment  of  the  scheme  of  common  random  numbers  (CRN) 
for  improving  the  convergence  of  the  algorithms  —  that  is,  the  random  factors  and 
^2,n  are  chosen  in  such  a  manner  that  =  ^2,n  =  ^n-  Implementation  of  CRN  in  Monte 
Carlo  optimization  is  rather  straightforward.  The  term  “Monte  Carlo  optimization”  is  used 
here  to  refer  to  the  procedure  of  hnding  the  optimal  solutions  through  computer  simulation 
where  the  random  factors,  represented  through  a  sequence  of  psuedo-random  numbers,  can 
be  controlled  [see  Bratley  et  al.  (1983)].  Computer  simulation  is  often  necessary  when  the 
form  of  L{X{9,^))  is  too  complicated.  This  is  the  case  when  L{X{9,^))  represents  a  per¬ 
formance  measure  of  a  stochastic  system  such  as  queueing  systems,  manufacturing  systems, 
transportation  systems,  and  communications  networks  [see,  e.g.  Ho  and  Cao  (1991);  Bratley 
et  al.  (1983);  Law  and  Kelton  (1982)].  In  Section  6,  we  will  give  an  example  where  the 
scheme  of  common  random  numbers  is  feasible. 

We  use  the  term  CRN  in  a  much  narrow  sense.  The  term  CRN  has  more  general  and 
sometimes  ill-posed  meaning  than  we  intend  in  this  paper  [see  Glasserman  and  Yao  (1992)]. 
In  this  paper,  CRN  simply  refers  to  that  simulation  experiments  be  performed  with  the 
same  stream  of  random  numbers.  As  far  as  the  KW  algorithm  (2)  or  the  MD  algorithm 
(4)  is  concerned,  CRN  requires  that  estimates  of  -JiJ)  +  5)  and  .J{9  —  5)  be  obtained  from 
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simulation  experiments  using  the  same  stream  of  random  numbers  {■Cn}-  Let  F{6,x)  denote 
the  distribution  function  of  X{6,^).  Then  any  experiments  with  hn  constructed  in  the 
following  form  conform  the  CRN  requirement: 

,,,  -  L(Y2(0n,S„,(„)) 

- 

where  the  marginal  distributions  of  Yi{9n,Sn,^n)  and  Y2{9n,Sn,^n)  are  F{9n  +  Sn,x)  and 
F{9n—Sn,  x),  respectively.  Note  that  the  joint  distribution  of  Yi(6*n,  Sn,  Cn)  and  Y2{9n,  Sn,  Cn)  is 
left  open,  which  may  be  used  to  improve  the  estimation  variance.  For  a  distribution  function 
F{9,x),  its  inverse  function  is  dehned  as  F~^{9,x)  infju  |  F{9,u)  >  x}.  Cambanis 
and  Simons  (1976)  and  Whitt  (1976)  proved  that  the  variance  of  (6)  is  minimized  when 
Yi(^9n,  Snj  ^n)  F  (9n  T  ^ni^n)  and  h^(^n)  C»i)  Y  i9n  fa.  this  paper  we 

assume  that  the  form  of  L{X{9,^))  is  given  and  hxed.  The  term  CRN  merely  refers  to  the 
special  choice  of  =  ^2,n-  We  will  show  that  the  use  of  CRN  can  signihcantly  increase 
the  rate  of  convergence  for  the  KW  algorithm  (2)  or  the  MD  algorithm  (4)  from  to 

at  least  For  a  large  class  of  functions,  the  rate  can  be  increased  to  the  best 

possible  rate  for  stochastic  approximation  algorithms.  CRN  increases  the  rate  of  convergence 
by  reducing  the  variance  of  hn-  Let  Rar[X]  denote  the  mathematical  variance  of  a  random 
variable  X.  Assume  that  Var[L{X{9,^)y\  is  continuous  in  9,  is  bounded  from  below  by  a 
positive  constant  and  from  above  by  a  constant.  Then  if  and  ^2,n  are  independent,  the 
variance  of  hn  is 

{Var[L{X{9n  +  5„,6,n))]  +  Var[L{X{9n  -  5.,  6,n))])/(2<5n)'  =  0{l/5l) 

which  grows  quadratically  as  goes  to  zero.  We  say  a  variable  /(s)  =  0(s)  if  |/(<s)/s|  < 
C,  C  >  0  is  a  constant  independent  of  s  (/(s)  =  o(s)  if  lim|/(s)/s|  =  0  when  s  goes  to 
zero  or  inhnity  depending  on  the  context).  It  is  such  a  large  variance  of  hn  that  slows 
down  the  convergence  rate  since,  when  5n  is  suitably  chosen,  the  rate  would  be  if  the 

variance  of  hn  is  bounded.  As  we  will  show  later,  the  convergence  rate  for  (2)  depends  on 
how  fast  the  variance  of  hn  goes  to  inhnity.  The  slower  the  variance  goes  to  inhnity,  the 
faster  the  convergence  rate  for  (2)  is.  CRN  has  been  observed  ehective  for  variance  reduction 
for  decades.  It  is  perhaps  the  most  popular  method  for  variance  reduction  [Bratley  et  ah 
(1983);  Conway  (1963);  Fishman  (1974);  Hammersley  and  Handscomb  (1964);  Heikes  et  ah 
(1976);  Kleijnen  (1974);  Law  and  Kelton  (1982)]. 

The  rest  of  the  paper  is  arranged  as  follows:  In  Section  2  we  examine  the  rates  of 
convergence  for  the  KW  algorithm  under  a  very  general  setting  that  covers  many  interesting 
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situations,  the  analysis  is  extended  to  the  MD  algorithms  in  Section  4.  In  Section  4  we 
show  that  the  use  of  CRN  can  reduce  the  variance  of  by  orders  of  magnitude,  which  in 
tnrn  accelerates  the  convergence  of  the  KW  algorithm.  In  Section  5,  we  examine  the  rate 
of  convergence  for  the  MD  algorithm  nnder  CRN.  In  Section  6,  we  extend  the  results  to 
multivariates.  A  practical  example  is  given  to  illustrate  the  feasibility  of  applying  CRN  in 
practice.  Finally,  a  snmmary  is  provided  in  Section  7. 

2.  Rates  of  convergence  for  the  KW  algorithm.  In  this  section,  we  examine  the 
rates  of  convergence  for  the  KW  algorithm  (2)  under  general  assnmptions  on  h„.  We  do  not 
assnme  that  hn  is  of  the  form  (3).  We  will  see  later  that  snch  a  treatment  covers  several 
important  cases. 

Assnme  that  <5^  >  0  goes  to  zero  as  n  — )■  cx)  and,  for  n  >  no  >  1,  satisfies  the  following 
assnmptions: 

(7)  E[K\dn]  =  J'{dn)  +  A„,  |AJ  <  , 

and 

(8)  Var[hn\0n]  <  c6l, 

where  b,c,l3  are  real  nonnegative  nnmbers,  7  G  R.  The  form  of  (7)  assures  that  hn  is  an 
asymptotically  unbiased  estimate  of  J'{9)  when  /d  >  0.  When  7  >  0,  the  variance  of  the 
estimate  goes  to  zero  as  n  — )■  cx).  This  is  generally  impossible  in  practice.  When  7  =  0  such 
as  in  the  RM  algorithm,  the  variance  is  bounded.  In  the  case  that  7  <  0,  e.g.  7  =  —2  if 
hn  is  defined  by  (3)  and  if  and  .^2,n  are  independent,  the  variance  of  hn  goes  to  inhnity. 
Next  we  examine  the  convergence  and  the  rate  of  convergence  for  the  KW  algorithm  (2). 
The  commonly  used  criterion  for  measuring  the  convergence  of  a  stochastic  seqnence  {On} 
is  the  root  mean  square  error  (RMSE)  dehned  as 

RMSE0„  =  {E[{On-9*f]Y/\ 

If  RMSEg^  =  s  >  0,  we  say  that  {6*„}  converges  at  the  rate  of  n~^  or  the  convergence 

rate  for  {9n}  is 

We  need  the  next  lemma  that  was  due  to  Chung  (1954)  and  was  formulated  in  the  present 
form  by  Fabian  (1971). 

Lemma  1.  Let  s,  t,  B,  An,  bn  he  real  numbers,  0  <  s  <  1,  t  >  0,  B  >  0.  Define  b+  =  0  if 
s  <  1  and  b+  =  t  if  s  =  1  and  assume  that  c  =  lim„_,.oo  An  —  b+  exists  and  is  finite.  If  for 
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n>  no 


B 


hn+l  <hn{l 


A 


n 


)  + 


n 


s-\-t 


and  if  c>  0,  then 

lim  supn*6n  <  B/c. 

n^oo 

The  statement  remains  valid  if  all  the  inequalities  are  reversed  and  lim  sup  is  replaced  by 

lim  inf. 

The  following  Theorems  1  and  2  give  the  convergence  rate  for  the  KW  algorithm  (2)  with 
hn  satisfying  (7)-(8): 

Theorem  1.  Assume  that  {6*„}  is  determined  by  (2)  and 
(A1 ).  an  =  an~°',  Sn  =  dn~^ ,  0  <  a  <  1,  ?]  >  0,  a,  d  >  0; 

(A2).  J{6)  is  increasing  for  6  <  6*  and  decreasing  for  9  >  9*,  and  there  exist  two  constants 
Ki,  K2,  0  <  Ki  <  K2  <  00,  such  that  for  all  0  G  0, 


Ki\9-9*\  <  \  J\9)\  <  K2\9-9* 


(A3),  conditioned  on  9n,  hn  at  the  nth  iteration  is  independent  of  those  at  the  other  iterations. 
Then,  if  a  =  (1/2)  minjo;  +  777,  2(3rj}  and  0  <  a  <  aKi,  we  have 

(9)  lim  svcpn^"^ E[{9n  —  9*Y]  <  C 

where  C  >  is  a  constant.  The  convergence  rate  for  RMSEg^  is  at  least 
Proof.  Without  loss  of  generality,  we  assume  that  9*  =  0.  Then 

=  E\el]-2a„E[e„K]+alE\hl] 

=  E[el]  -  2o„B[9„( J'(9„)  +  A„)]  +  alHE[hJf  +  Var[h„]). 

According  to  (7)-(8),  we  have 

(10) B[»7J  <  E[el]  -  2a„i5|»„J'(«„)]  +  2ta„yB[|«„||  +  2al(E[J' (OVi?  +  hHf)  +  calSl 

By  Assumption  (A2),  9nJ'{9n)  >  0  and 

(11)  0nJ'{9n)  >  K,9l  {J'{9n)f  <  K^- 
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Furthermore,  for  any  e„  >  0,  \0,i\  <  and  consequently 


b[|»„|]<£„  +  -bK]. 


By  setting  0  <  e  <  1  and 


26(5£ 


— 


Xie’ 


we  have 

(12) 

Substituting  (11)  and  (12)  into  (10),  we  obtain 


46= 


(13)  <  J!K]|1  -  (2  -  e)A'iO„  +  2Klal]  +  2i\lSf  +  ■ 

According  to  (13),  also  noting  Assumption  (Al),  we  can  choose  an  ni  >  no  >  1  such  that 
for  all  n  >  ni 


Ar 


B 


where 


—  (2  —  e)aKi  — 


2iy2  2 


n“  A'le 

If  aKi  >  a,  we  can  always  choose  e  >  0  so  small  that  (2  —  e)aKi  >  2a.  Applying  Lemma  1, 
we  obtain  (9)  with  C  =  B/{{2  —  e)aiLi)  if  o;  <  1  and  C  =  B /{{2  —  e)aKi  —  2a)  if  a  =  1.  I 


It  follows  directly  from  Theorem  1  that  {On}  converges  to  0*  as  long  as  cr  >  0,  or 
equivalently,  as  long  as  a  +  777  >  0.  When  a  +  77  <  0  which  is  possible  only  when  7  <  0, 
the  variance  of  grows  to  inhnity  at  the  rate  of  n*  with  t  =  —'yrj  >  a.  It  is  obvious  from 
(2)  that  {On}  does  not  converge.  Another  extreme  case  is  that  7  >  0.  In  this  case,  a  can  be 
made  arbitrarily  large  by  choosing  appropriate  77.  The  convergence  rate  for  {On}  can  be  made 
arbitrarily  large  if  77  can  take  any  value.  In  fact,  by  setting  77  — )■  cx)  in  (13)  (or  equivalently, 
t  0)  and  ttn  =  a  such  that  0<g  =  l  —  (2  —  e)K2a  +  2X^0?  <  1  for  sufficiently  large  tt, 
we  have 

^[Ci]  <  <iE[0l]. 

The  convergence  rate  for  the  sequence  {On}  is  that  of  a  geometric  progression.  Unfortunately, 
this  is  a  very  special  case.  One  should  not  expect  7  >  0  in  practice.  Both  of  the  situations 
7  >  0  and  q;  +  777  <  0  are  too  special  to  deserve  further  study.  The  most  interesting  case  is 
when  7  satishes  —a/r]  <  7  <  0. 
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Theorem  1  shows  that,  when  hn  satisfies  (7)-(8),  converges  with  probability  one  to 
the  optimal  parameter  9*  at  a  rate  of  at  least  n~" .  We  can  further  prove  that  {On}  converges 
exactly  at  this  rate  as  interpreted  in  the  following  Theorem  2. 

Theorem  2.  Assume  that  Assumptions  (A1)-(A3)  are  satisfied. 

1.  //a  +  7?7  <  2137],  aK2  >  a,  EfifiOn]  =  +  A„,  |A„|  <  and  V^ar[h„|6'„]  >  cS{[, 

we  have 

infn2"E[(0n  -  e*f]  >  Ci 

where  Ci  >  is  a  constant. 

2.  If  a  +  'jf]  >  2/3r],  EfifiOn]  =  J'{0n)  +  bS^{l  +  £„),  and  J'(6'„)  =  {On  -  0*){K3  +  r^), 
En  =  o(l)  and  Tn  =  o(l)  uniformly  as  n  ^  oo,  =  .J''{0*)  >  0,  a  <  aK^,,  then 

lim  sup  —  6*]  <  —C2 

where  C2  >  is  a  constant. 

Proof.  Let  consider  the  first  statement.  For  simplicity  and  without  loss  of  generality, 
we  assume  6*  =  0.  Parallel  to  the  derivation  of  (10)  we  have 

E[el^,]  =  E[el]  -  2anE[en{J'{0n)  +  A„)]  +  al{{E[hn]f  +  Var[hn]} 

which  implies 

E[el^,]  >  E[el]  -  2anE[enJ'ien)]  -  2ban6^E[\en\]  +  caldl 

Assumption  (A2)  implies  that  0  <  OnJ'ifin)  <  K20n  which,  together  with  (12)  where  Ki  is 
replaced  by  K2,  shows  that 

452 

(14)  £[97i1  >  E[el]{l  -  (2  -  0K2a„)  +  ca^l  - 

If  a  +  77  <  2/3rj,  there  exists  an  no  >  1  such  that  when  n  >  no 

a  a2/3  > 

Therefore,  we  know  from  (14)  that  when  n  >  Uq 


Since  aK2  >  cr,  we  can  always  choose  e  >  0  so  small  that  (2  —  e)aK2  >  2(7.  The  first 
statement  of  the  theorem  follows  from  applying  Lemma  1  with  Ci  =  ca?‘(P /{2{2  —  e)aK2)  if 
a  <  1  and  Ci  =  ca^d^ /{2{2  —  e)aK2  —  da)  if  a  =  1. 

If  q;  +  7?7  >  2(3ri,  we  know  that  a  =  I3r]  >  0  and 

=  E[e^]-anE[f{e^)]-baX{l  +  en) 

(15)  =  E[9n\{l  -  Ksttn)  +  anE[TnOn\  -  bttnS^il  +  E[en]) 


Define  Zn  =  n^E[9n].  Then  (15)  shows  that 

Zn+l  =  [E[9n\{^  —  Ksan)  +  (lnE[Tn9n]  —  bttnb^i^  +  £n)] 

=  ^n[l  H - E^ttn - K^ttn  +  0{—)]  +  (1  H — ^ {anE[Tn9n\  —  (1  +  E[en\))- 

n  n  n 

Denote 

C7^  LJ  1 

An  =  1  H - K^ttn - K^ttn  +  0{  —  ), 

n  n 

Bn  =  {l  +  -Yn^{ban5i{l  +  E[en])  -  a„^[r„0„]). 
n 

Then 

(16)  Zn+l  —  AnZn  Bn- 

Note  that  a„  =  an““,0  <  a  <  l,6n  =  dn~'^,aK^  >  a.  We  may  choose  Ai,A2  >  0,ni  >  1 
such  that,  for  all  n  >  ni, 


(17) 


Ai  ,  a 

o<i-4<di,,  =  i  +  - 
n“  n 


aKs 

n“ 


A2 

n°‘ 


Since  Assumptions  (Al)-(A3)  in  Theorem  1  are  satisfied,  lim^^oo  sup  <  C  which 

implies  that 

lim  supn°'|A'[6'„]|  <  lim  snpiri^'' EW^V^'^  <  \fC. 

n^oo  I  L  JI  n^oQ  ^  L  /tj/ 

According  to  the  assumptions  that  =  o(l),  =  o(l)  uniformly  as  n  — )■  cx),  and  5^  = 

d^n~^ .  There  exists  an  n2  >  1  such  that,  when  n  >  n2, 

Bn  =  (1  +  ^Yn'^ibanS^il  +  E[en])  -  anE[r„6'„]) 

>  (1  +  ^YnYbanS^nY  +  E[en])  -  anE[\Tn\]E[\9n\]) 

(18)  >  (1  +  ^Yn'^^band^  >  ]^rfban5Y 
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Let  no  =  max{ni,n2}.  Then,  from  (16)  we  know  that  for  all  n  >  no 

n  n—1  n 

(19) 


^no  \  ^  ^  Aj 

i=no  i=no  j=2+l 


Since  0  <  a  <  1,  (17)  shows  that 

n  n 

(20)  0  <  lim  n  <  lim  IT  (1  “  =  0- 


^2^ 


^=?^o 


i=no 


Furthermore,  lim^^oo  =  0,  and  (17)  and  (18)  imply  that 


n—1 


^  nhd^  nhd!^  abd^^  A-, 

(21)  EB.  n  n  Ar  >  E(l-  ^)-- 


.  ^  -  2n“  ”  “  2n“  . 

z=no  j=t-hl  z=no  j=i-\-\  2=no  2=no 


On  the  other  hand, 


1 

lim  —  y  (1  -  — 

n— >-00  79<^ 

i=no 


1,  if  0  <  a  <  1 

1  —  if  a  =  1- 


Substituting  the  preceding  inequality,  (20)  and  (21)  into  (19),  we  see  that 


lim  sup  Zn  <  —  O2 

n^oo 

with  0*2  =  (l/2)a6(i^(l  —  e~^^)  >  0.  This  is  exactly  what  we  want  to  prove.  I 

Theorems  1  and  2  show  that  the  convergence  rate  for  {On}  is  generally  n~‘^.  If  we  are 
free  to  choose  the  positive  numbers  a,  rj,  it  follows  directly  from  Theorem  1  that 

Corollary  1.  Assume  that  hn  satisfies  (7)-(8)  and  7  <  0.  Under  Assumptions  (A2)- 
(A3)  in  Theorem  1,  the  best  possible  convergence  rate  for  the  KW  algorithm  (2)  is 
which  is  achieved  by  setting  a  =  1,  p  =  l/(2/3  —  7),  and  by  choosing  appropriate  a,d  >  0. 

For  the  KW  algorithm  (2)  with  dehned  by  (3),  assume  that  J{9)  is  continuously 
differentiable  of  order  up  to  three  and  the  third  order  derivative  J"'{0)  is  uniformly  bounded 
on  0,  we  have 

B1A„|«„|  =  ~  =  J'(f)n)  +  =  J'(«n)  +  0(C) 

ZOn  6 

where  On  G  [On  —  8n,  On  +  5n]-  In  Ibis  case,  {3  =  2.  If  the  assumptions  (8)  and  (Al)-(A3)  are 
satished  and  if  the  positive  number  a  is  chosen  sufficiently  large,  we  know  from  Theorems  1 
and  2  that 

1 

(22)  (T  =  -  min{Q;  +  77, 4?7}. 
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If  we  use  the  one-sided  finite-difference  approximation  in  (2): 

(23)  ^  L{X{9n  +  5n,^l,n))  -  L{X{9^,^2,n)) 

S2,n 

and  if  J{9)  is  twice  continuously  differentiable  and  the  second  order  derivative  J"{9)  is 
bounded  on  0,  then  for  any  6n  there  exists  a  9n  &  [9n,  9n  +  Sn]  such  that 

E[h^\9]  =  -  J{0n)  ^  ^  ]-j"{9n)5n  =  /(^n)  +  0{5n). 

Therefore,  /3  =  1.  Under  the  same  conditions  as  those  in  the  previous  case  we  know  that 

(24)  (T  =  -  min{Q; -|- 7?],  27]}. 

It  is  clear  from  (22)  and  (24)  that,  under  the  same  condition  for  the  variance  Uar[h„|6'„], 
the  convergence  rate  of  the  KW  algorithm  is  faster  when  symmetric  differences  are  used 
than  that  when  one-sided  differences  are  used.  Corollary  1  shows  that  the  best  possible 
convergence  rate  depends  on  two  factors — how  fast  the  bias  decreases  to  zero  and  how  slow 
the  variance  increases  to  inhnity.  Using  symmetric  hnite  difference  (3)  instead  of  the  one¬ 
sided  hnite  difference  (23)  can  reduce  the  bias  of  hn-  To  summarize,  we  have  the  following 
conclusion  which  will  be  used  later. 

Corollary  2.  Suppose  that  (A1)-(A3)  are  satisfied.  If 

(A4).  J{9)  is  continuously  differentiable  of  order  up  to  three  and  the  third  order  derivative 
.J"'{9)  is  bounded  on  0, 

then  the  best  possible  convergence  rate  for  the  KW  algorithm  (2)  with  hn  defined  in  (3)  is 
n-2/(4-U,  If 

(A5).  .I{9)  is  twice  continuously  differentiable  and  the  second  order  derivative  J"  {9)  is  bounded 
on  0, 

then  the  best  possible  convergence  rate  for  the  KW  algorithm  (2)  with  hn  defined  in  (23)  is 

3.  Rates  of  convergence  for  the  MD  algorithm.  The  rate  of  convergence  of  the 
MD  algorithms  was  established  by  Nemivoski  et  ah  (2009)  when  the  hn  in  (4)  is  an  unbiased 
estimate  of  the  derivative,  and  by  Duchi  et  ah  (2012,  2013)  when  the  hn  is  approximated  by 
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the  one-sided  finite  difference  (23).  In  this  section,  we  examine  the  rate  of  convergence  of  the 
MD  algorithm  for  general  hn-  Again,  we  only  assume  that  satisfies  (7)- (8).  For  notational 
consistence,  the  norm  ||.||  in  (4)  is  taken  as  the  I2  norm.  Its  dual  norm  ||a;||*  :=  sup||j^ll<;^  y'^x 
is  also  the  /2  norm.  Define 


d 


n 


1 

n 


We  next  examine  the  convergence  of  J{0„ 


Theorem  3.  Assume  that  {9n}  is  determined  by  (4),  and 


(Bl).  'ipiO)  is  strongly  convex,  0  is  compact  and  convex,  and  there  exists  r  >  0  such  that 
D{e%e)  <  {l/2y,r  >  O  for  aliee  0; 


(B2).  L{X)  is  closed  convex,  and  there  exist  two  constants  Ki,K2,0  <  Ki  <  K2  <  00,  such 
that  for  all  6  G  Q, 


Ki\e-e*\  <  \  j\e)\  <  K2\e-e* 


(B3).  conditioned  on  9^,  at  the  nth  iteration  is  independent  of  those  at  the  other  iterations. 
Then 


yi  yi  n  yi  n  yi  n  yi  n 

(25)  Eim.)  -  J(r )]  <  El  +  , 


nar,  n 


i=l 


^  r=i 


n  T=i 


^  r=i 


where 


r2  c  Ky  r  r  - 


Proof.  Under  Assumptions  (B1)-(B3),  we  know  from  Duchi  et  al.  (2013),  eqn.  (13), 
that 

2  A  A 

(26)  j(»„)  -  j(r )  <  - —  +  E  “.A?  -  -  E  A.(».  -  »•). 

2na„  inn 

Therefore, 

2  ^  ^  1  ” 

(27)  E[j(k)  -  J(r)|  <  ^  ^  E  “.-EKl  +  -  E  ^11  A.(».  -  »')ll' 

The  assumptions  (7)- (8)  give 

(28)  E[hl]  =  Var[hf\  +  {E[J\9i)  +  A,])^  <  c5]  +  2{E[.J\9i)]Y  + 
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On  the  other  hand,  according  to  Assumptions  (B1)-(B2), 


(29) 


<  (I<2E\\e,  -  9-|])"  < 


2k 


and 

(30) 


E[\^,\\ei-e*\]  < 


b5^r 


By  combining  (28)-(30)  with  (27),  we  obtain 

.2  ^  n  ” 


A  ,25  ,  br  " 


2nK?  UK  n^fS.  (zt 


which  is  exactly  (25).  I 


A  special  case  of  interest  is  7  =  0,  which  corresponds  to  bounded  variance  of  derivative 
estimation.  The  following  Corollary  3  provides  a  bound  on  the  convergence  of  the  MD 
algorithm  for  this  case  with  properly  chosen  {ttn},  {5n}- 


Corollary  3.  Suppose  that  (A4),  (B1)-(B3)  are  satisfied.  Let  7  =  0,  , 

dn  =  dn~^ ,  a  >  0,d  >  0. 

(a)  For  hn  defined  in  (23), 

(31)  E[J{en)  -  J(r)]  <  (Cl  +  2^2  +  +  (2.5^4^=")“  +  (Csd)^^^. 

yjn  n  n 

(b)  For  hn  defined  in  (3), 

(32)  E[Jien)  -  J{e*)]  <  (Cl  +  2C2  +  +  (9C4dV7)-  +  i2C,d^)-. 

\/  Ti  Ti  n 


Proof.  For  7  =  0,  a„  =  5n  =  dn~^,  combining  the  first  three  terms  in  (25) 

gives  the  first  term  in  (31).  For  hn  dehned  by  the  one-sided  hnite  difference  (23),  under 
Assumption  (A4),  we  have  (3  =  1.  Consequently,  the  fourth  term  in  (25)  is 


n 


i=l 


C^ad^ 

n 


2=1 


< 


C4a(i^2.5 

n 


The  last  term  is 


n  ri 

AT  E  —  E  *  ^ 


C^d{l  +  logn) 


n 


n 


n 


i=l  i=l 

Combing  the  previous  two  inequalities  with  (25)  gives  (31). 
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For  hn  defined  by  the  symmetric  finite  difference  (3),  under  Assumption  (A4),  we  have 
/3  =  2.  Consequently,  the  fourth  term  in  (25)  is 


9± 

n 


n 


C^ad^ 

n 


n 


i=l 


< 


C'4ad^(9/7) 

n 


The  last  term  is 


Cs  ”  ^2  ^  2C,d? 


n 


n 


n 


i=l  i=l 

Combing  the  previous  two  inequalities  with  (25)  gives  (32).  I 


Duchi  et  ah  (2013)  investigated  the  convergence  of  the  MD  algorithm  using  the  one-sided 
hnite  difference  (23)  as  an  approximation  to  the  derivative.  The  bound  (31)  is  technically 
the  same  as  that  in  Duchi  et  ah  (2013).  When  the  symmetric  hnite  difference  (3)  is  used,  the 
logn  factor  disappears  in  the  last  term  of  (32),  which  indicates  that  the  symmetric  hnite- 
difference  approximation  (3)  leads  to  a  tighter  bound  under  similar  assumptions,  which  is 
due  to  that  the  symmetric  hnite  diherence  (3)  typically  provides  more  accurate  estimate  of 
the  mean  of  the  derivative  than  the  one-sided  ones  do.  Note  that  Duchi  et  ah  (2012,  2013) 
implicitly  assumes  that  CRN  is  used  in  calculating  the  hnite  diherence  (3)  or  (23)  that  will 
be  covered  in  Sections  4  and  5. 

It’s  worth  of  noting  that  the  rate  of  convergence  for  the  MD  algorithm,  as  given  by 
(25),  is  which  is  not  ahected  by  the  choice  of  hnite-diherence  approximation,  either 

symmetric  or  one-sided,  to  the  derivative.  This  is  by  design  since  the  MD  algorithm  was 
originally  proposed  for  improving  the  robustness  in  the  choice  of  stepsizes  at  the  cost  of 
slower  convergence. 


When  a  hnite  diherence  is  used  to  approximate  the  derivative,  it  is  desirable  to  have 
— )■  0  as  n  — )■  cx)  to  ensure  asymptotically  unbiased  estimate  of  the  derivative.  In  this  case, 
it  is  possible  (and  likely  in  practice!)  that  the  variance  of  the  estimate  goes  to  inhnity.  This 
is  a  special  case  of  (25)  with  7  <  0.  Therefore,  Theorem  3  allows  hexibility  to  cover  general 
cases. 

A  special  situation  is  when  L{X{9n  +  5n,^i,n))  and  L{X{9n  -  dn,^2,n))  or  L(X(9n,^2,n)) 
in  (3)  or  (23)  are  sampled  independently.  In  this  case,  7  =  —2.  Assume  further  that  {a„} 
and  {(5„}  are  specihed  as  in  Assumption  (Al).  Then  the  right  hand  side  of  (25)  becomes 


H(n) 


Cl  ,  ^2 


C 


nttr 


n 


i=l 


G 


+  ^  y  a.V  +  ^  i:  Oi  +  - 1:  a,e  + 


i=l 


n 


2=1 


n 
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where 


/^/^n  ^  n  /^n  ^  n 

=  +  —  E  ar  +  E  V  ai-^  +  —  E  +  —  E 

/7'nl~(^  'n  ^  'n  ■‘- — ^  'n  ■‘- — ^  'n  ■‘- — ^ 


an^ 


n 


n 


n 


n 


i=i  "'  i=i  "'  i=i  "'  i=i 

=  0(n-^+“)  +  0(n-“+2^)  +  0(n-“)  +  0{n-^-^^)  +  0{n-^^) 


=  0{n-^), 


a  =  min{l  —  a,  a  —  2ri,  a,a  +  2/3ri,  /3ri}  =  min{l  —  a,a  +  2ri,  Pr]}. 

For  the  one-sided  hnite  difference  (23),  /3  =  1.  Then 

a  =  minjl  —  a,a  +  2ri,  rj}  <  1/4. 

For  the  symmetric  hnite  difference  {3),  /3  =  2.  Then 

a  =  min{l  —  a,a  +  2ri,  2r]}  <  1/3. 

The  previous  discussion  can  be  summarized  in  the  following  Corollary  4. 

Corollary  4.  Assume  that  Assumptions  (Al),  (A4),  (B1)-(B3)  are  satisfied,  and  that 
L{X{6n  +  6n,^i,n))  and  L{X {6n  -  6n,  ^2,n))  in  (3)  (or  L{X{0n,(2,n))  in  (23))  are  independent. 
Then 

(i)  the  best  possible  rate  of  convergence  for  the  upper  bound  H{n)  is  when  the  one¬ 

sided  finite  difference  (23)  is  used, 

(a)  the  best  possible  rate  of  convergence  for  the  upper  bound  H{n)  is  when  the  sym¬ 

metric  finite  difference  (3)  is  used. 

Note  that  the  rates  of  convergence  are  only  upper  bounds  of  E[J{9n)].  Such  rates  of 
convergence  are  consistent  with  those  for  {dn}- 

4.  The  KW  algorithm  with  CRN.  In  this  section,  we  will  show  how  CRN  can 
accelerate  the  convergence  of  the  KW  algorithm.  For  clarity  and  without  getting  trapped 
into  unnecessary  tediousness  of  details,  we  focus  our  attention  on  the  case  in  which  f,  E  R  is 
a  real  one-dimensional  random  variable.  In  Monte  Carlo  optimization,  is  usually  a  psuedo- 
random  number  generated  by  a  computer.  For  most  applications,  such  a  pseudo-random 
number  is  sufficiently  good  to  be  regarded  as  a  random  number  uniformly  distributed  on 
[0, 1).  In  Section  6,  we  extend  the  results  to  general  situations. 
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To  avoid  repetition,  we  only  consider  the  hn  defined  as  in  (3)  with  =  ^2,n  =  Cn- 
The  analysis  is  applicable  to  the  one-sided  finite-difference  approximation  (23)  without  any 
difficult.  For  a  given  6n,  hn  is  a  finite-difference  approximation  to  the  derivative  J'{0)  at 
6  =  On-  For  simplicity,  we  omit  the  subscript  n.  Then 

(33)  ^  ^  L[X(e  +  s,!:,))  -  L(X[0  -  S,!:,))  ^ 

26 

The  mean  of  h  is 

26 

which  is  the  same  as  that  of  (3)  without  the  use  of  CRN.  However,  the  variance  of  (33),  as 
we  will  show,  is  generally  smaller  than  that  of  (3)  without  the  use  of  CRN  when  <5  >  0  is 
sufficiently  small.  We  will  also  show  that  the  reduction  in  the  variance  of  h  may  have  a  sig¬ 
nificant  impact  on  the  convergence  rate  for  the  KW  algorithm  for  Monte  Carlo  optimization. 
Toward  that  end,  we  need  to  specify  the  generation  of  the  random  variable  X{6,^)  with  a 
given  distribution  F{0,x).  Next  we  examine  the  variance  of  (33)  for  several  popular  random 
number  generation  methods.  Note  that 

Var\h]  =  j^^{El{L(X{0  +  i.O)  -  L(X(0  -  {)))"]  +  (AS  +  ^)  -  AS  - 

If  J{6)  is  continuously  differentiable  on  0  with  bounded  derivatives,  then 

(34)  Var\h]  =  ^B[(L(A'(»  +  i.O)  -  L(X(e  -  i,0))"l  +  0(1), 

4-1-  Inversion  method.  Inversion  is  one  of  the  most  popular  methods  for  random 
variable  generation.  Let  F{0,x)  be  the  distribution  function  of  X{6,^).  The  inversion 
method  generates  the  random  variable  X{6,^)  in  the  following  way: 

1.  Generate  a  random  number  ^  uniformly  distributed  on  [0, 1). 

2.  Setx{e,o  =  F-^{e,0- 

Then  it  is  straightforward  to  verify  that  X{6,^)  has  the  desired  distribution.  Note  that  the 
mapping  F{6,  x)  :  R  R  is  not  one  to  one  in  general.  To  ensure  its  existence  for  general 
distribution  functions,  the  inverse  function  is  defined  as 

F~^{9,^)  =  min{a;  |  F{9,x)  >  ^,  x  G  R} 
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which  is  different  from  the  usual  definition  [see  Krantz  (1991)].  It  coincides  with  the  usual 
definition  if  F{6,x)  is  continuous  and  strictly  increasing.  Such  a  definition  of  the  inverse 
function  covers  both  continuous  and  discrete  random  variables.  For  example,  consider  a 
discrete  random  variable  X{9,^)  =  Xi  with  probability  Pi{0).  Define  po{0)  =  0,pi{9)  = 
Yl]=iPj{9)  for  i  >  1.  Let  ^  be  uniformly  distributed  on  [0, 1).  The  inversion  method  gives 
=  2:*  if  G  [pi-i{9) ,  pi{9)) .  Then  direct  verification  shows  that  X{9,^)  obeys  the 
desired  distribution.  This  is  a  discrete  version  of  the  inversion  method. 

In  order  to  proceed  with  our  discussion,  let  us  first  examine  the  properties  of  distribu¬ 
tion  functions.  A  distribution  F{9,x)  is  a  nondecreasing  and  right-continuous  function  of 
X.  F{9,  x)  has  at  most  countably  many  points  of  discontinuity  on  R  and  all  of  the  dis¬ 
continuities  are  of  the  first  kind  —  that  is,  for  any  x  E  R,  F{9,x~)  =  limy^x  F{9,y)  and 
F{9,x^)  =  \imyi3;  F{9,y)  exist  and  are  finite  [e.g.  Krantz  (1991), 149-150].  Therefore,  we 
can  divide  R  into  \JiBi{9)  =  R,  where  Bi{9)  =  [bi{9),bi+i{9)),  such  that,  for  each  i,  F{9,x) 
is  continuous  on  Bi{9),  but  jumps  at  bi{9).  Assume  that,  for  each  i,  F{9,x)  is  piecewise 
differentiable  on  Bi{9).  Then  F^{9,x)  >  0  whenever  it  exists.  We  further  divide  the  interval 
Bi{9)  into  subintervals  according  to  whether  the  derivative  of  F{9,x)  with  respect  to  x  is 
zero  or  not.  For  simplicity,  we  assume  that  Bi{9)  =  B^{9)  U  Bf  (9)  such  that  F^(9,  x)  =  0  on 
B^(9)  =  lbi(9),  Ci(9)]  and  F(9,x)  =  Fi(9,x)  is  continuously  differentiable  with  strictly  posi¬ 
tive  derivatives  on  B^(9)  =  (ci(9),  bi+i(9)).  It  is  possible  that  bi(9)  =  Ci(9).  On  Bj^(9),  the 
derivatives  F^(9,x)  should  be  understood  as  the  right  and  the  left  derivatives  at  bi(9),Ci(9), 
respectively.  It  is  possible  that  F(9,x)  is  not  differentiable  at  Ci(9).  The  inverse  F~^(9,^) 
is  defined  in  the  usual  sense.  It  is  continuous,  strictly  increasing,  and  differentiable  on 
(F(9,cm,F(9,b-,,(9))). 

Under  the  preceding  decomposition,  F(9,x)  is  discontinuous  at  bi(9),  is  a  constant  on 
B^(9),  and  is  strictly  increasing  and  differentiable  on  B^(9). 

The  following  Lemma  2  follows  directly  from  the  definition  of  the  inverse  function  and 
the  decomposition  of  F(9,x). 

Lemma  2.  Let  X{9,^)  be  defined  by  the  inverse  funetion  X{9,^)  =  F~^{9,^).  Let 
Ei{9)  =  [F{9,bi{9)),F{9,b~^fi9))).  Then  Efid)  C  [0,1)  and  for  any  f  E  Efid) 

r  bfi9),  ^ffE[F{9,b-{9)),F{9,cfi9))), 

(35)  ^(^,0=  cfi9),  ^ff  =  F{9,cfi9)), 

[  F-\9,0,  ^f^E{F{9,cfi9)),F{9,b-^fi9))). 
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We  need  the  following  result. 

Lemma  3.  Assume  that 

(Cl).  L{X)  and  L'x{X)  are  bounded,  .J{6)  is  continuously  differentiable  on  0; 

(C2).  for  each  i,  FfO^x)  is  continuously  differentiable  on  Bf{9)  with  strictly  positive  deriva¬ 
tives  with  respect  to  x,  and 

J2E[{max{F'g{e,x)y/F'^{e,x))I^+^g-^]  <  oo; 

i 

(C3).  hi{6)  is  continuously  differentiable  in  9,  and  f2ima,X0{b'f9))^  <  oo] 

(Cf).  for  each  i,  the  functions  F{9,Ci{9))  and  F{9,b~{9))  are  continuously  differentiable  in 
9,  and  ffiT[iax0\F'{9,Ci{9))\  <  oo,  |F'(6*,  &“(6'))|  <  oo, 

Define  Mi{9)  =  2J2i{L{ci{9))  —  L{bi{9))fi\F'{9,Ci{9))\.  Then  Mi(9)  >  0  is  bounded  for  all 
9.  If  Mi{9)  >  0,  we  have 

(36)  E[{LiXi9  +  6, 0)  -  L{Xi9  -  6, 0))1  =  Mfi9)5  +  0(6) 

as  6  >  0  goes  to  zero. 

Proof.  We  calculate 

lim \E\(L{X(e  +  i.O)  -  L{X{0  -  i,0))=] 

=  Mm  t  X  E[(L(X(0  +  6X))-  L{X(e  -  6, 

i 

Let 

B.(9,  =  \e[(L(X(0  +  i.O)  -  L(X{0  - 

Then 

(37)  Mm  L[(L(X(»  +  3,0)  -  L(X(e  -  3,0))''l  =  Mmj:^.)^^). 

i 

Next,  we  prove  that  the  limit  and  the  summation  commute.  Dehne  Di  i  =  3^(9— S)  n[0,  E{9-\- 
6,  b-{9  +  6))),  A, 2  =  Efi9  -  5)  n  Efi9  +  5),  and  A, 3  =  Efi9  -  6)  n[E{9  +  6,  b-+fi9  +  6)),  1).  It 
is  possible  for  each  of  Dij,j  =  1,  2,  3,  to  be  empty.  Then  Ei{9  —  5)  =  U  Aj  and 

(38)  Rfi9,S)  =  E Ao,  R^,J  =  ^E[{L{X{9  +  S,0)-L{X{9-S,0)?IdJ. 

i=i  " 
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By  Assumption  (Cl),  there  exist  A'i,A"2  >  0  such  that  \L{X)\  <  Ni,  \L'x{X)\  <  N2. 
Therefore, 

R^^  <  {2N,f\F{e  +  s,b;{e  +  s))-F{e-s,b-{e-6))\/6 
<  2{2N,fm^x\F\e,b-{e))\, 

u 

n..3  <  (2Njf\F{0  +  s,b-^,(e  +  S))  -  F(e  -  s,b-^,{0  -  6))\/s 
<  2(2Afi)"mtK|F(»,()-+i(»))|. 

u 

Without  loss  of  generality,  assume  that  F{6  +  S,b~{6  +  5)))  >  F{6  —  S,b~{6  —  5)))  and 
F{9  +  S,  +  5)))  <  F{e  -  S,  -  5))).  If  F{9  +  S,  c,{9  +  S))  >  F{9  -  S,  c,{9  -  S)), 

I  rF{0-5,Ci{e-5)) 

(39)  R.,2  =  -J  {L{X{9  +  S,0)-L{X{9-6,0)M 

0  JF(e+s,b-{e+5)) 


1  .F(0+5,Ci(0+5)) 

5  J F{e-s,ci{e-s)) 

1  .F(0+5,fe-+i(0+<5)) 

6  J F{0+s,ci{e+5)) 


{L{X{9  +  5,0)-L{X{9-5,0)fd^ 


{L{X{9  +  6,0)-L{X{9-S,0)?d^ 


1  .F(0-5,Ci(0-5)) 


S  J F(e+s,b-  (e+s))) 

I  .F(0+5,Ci(0+5)) 
6  J F{e-5,ci{e-5)) 


{L{h{9  +  5))-L{h{9-5))fdi 


(L{h{9  +  6))-L{F-\9-5,0)Ydi 


1  .F{e+&,b-^^{e+&)) 


6  J F{e+&,ci{e+5)) 

The  first  two  terms  of  (39)  are  bounded  respectively  by 


{L{F-\9  +  5,i))-L{F-\9-5,i))Ydi 


ANI  m.sji.{b'Y9)Y d  and  2(2W)^  max  |F'(6*,  Cj(6*))|. 

6  6 

The  third  term  of  (39)  can  be  rewritten  as 
1  /•F{e,ci{e)) 

-  {L{F-\9  +  6,0)-L{F-\9-S,0)Yd^ 

0  J F(6-\-S^Ci[9-\-S)) 

1  .F{e+s,b-^^{e+s)) 

+F  I  "  {L{F-\9  +  6,0)-L{F-\9-6,0)Yd^ 

+  -.  I  ,,,  {L{F-\9  +  6,0)-L{F-\9-6,0)?d^ 

0  JF(e,ci(e)) 

<  2{2Nif  max  \F'{9,Ci{9))\  +  2{2Nif  max  |F'(0,  6“+i(0))| 

0  9 
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Therefore,  i?i,2  is  bounded  by 

4iV2  max{b[{6))^6  +  4(2iVi)^  max  \F'{6,  Ci{6))\ 

0  6 

+2(2Af,)"max|F(»,6r^,(«))|  +  4/V|B|(max(f:,(«,i))Vf;U».2^))V«-)l'^ 

Similarly,  we  can  prove  that  if  F{9  +  <5,  Ci{9  +  5))  <  F{9  —  6,  Ci{6  —  5)), 

(40)  i?,,2  =  -/  {L{k{e  +  6))-L{k{e-S))fd^ 

0  J F(e+5,b-  {e+5))) 

1  /•F{e-5,ci{e-5)) 

+  -  /  {L{F-\e  +  6,0)-m{e-s))M 

0  J F[9-\-S,Ci{9-\-S)) 

1  rF(9-\-S,b-(9-\-S))) 

+  -J ,  ^  {L{F-\e  +  5,i))-L{F-\e-5,i))fdi 

0  JF{9-S,Ci{9-5)) 

<  4iV|  max(6'(6*))^5  +  4(2iVi)^  max  \F\6,  Ci{6))\ 

9  9 

+  2(2Af,)"max|F(»,6r^,(«))| 

+  4Af|E[(max(f:,(»,  x))VFi(e,  i))/B*(a|l^. 

Substituting  the  upper  bounds  for  Rij,j  =  1,2,3,  into  (37),  also  noting  the  assumptions 
(C2)-(C4),  we  see  that  Ri{9,  5)  is  uniformly  bounded  with  respect  to  S.  Therefore,  J2i  Ri{9,  S) 
converges  uniformly  in  (0,  (5o)  for  any  (5o  >  0.  By  the  Weierstrass  M-test  [Krantz  (1991), 211], 
we  know  that  the  limit  and  the  summation  commute.  From  (37), 

(41)  lmjT|(L(X(«  +  i,0)-L(A'(»-i,0))2]  =  limyB.(«,i)  =  yiimfl.(»,i). 

i  i 

We  next  calculate  lim5_,.o  i?i(6*,  5).  For  each  i,  there  exists  a  <5*  >  0  such  that  for  any  S  <  Si 

\F{9  +  6,  b-{9  +  6)))  -  F{9  -  6,  b-{9  -  5)))| 

<t  ,  min 

4  J=^— 1,2,2  +  1,2+2 

Note  that  Di  i  =  Ei{9  —  5)nSj_i(6'  +  5)  and  =  Ei{9  —  (5)nSj+i(6'  +  5)  when  6  <  Si. 
Therefore,  by  taking  into  account  that  each  of  Di  i  and  71*^3  may  be  empty,  we  have 

I.F(e+s,b-i9+s)) 

7?,i  <  I  ^  ^  (L{h{9  +  S))-L{F-\{9-S,0)fd^ 

JF{9-5,b-{0-S)) 

<  max \F'i9,  6-(0))|(L(6,(0  +  5))  -  L(F;-_\(0  -  5,0))^ 

tf 

=  0(1) 
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where  f  G  [F{6  —  6,bi  {6  —  S)),F{6+S,bi  (6*+(5))).  Similarly,  Ri  s  =  o(l).  Hence,  lim^-^o-Djj  = 
0  for  j  =  1,3.  Also,  the  analysis  of  (39)  and  (40)  shows  that 

(42)  =  2{L{c0))  -  L{b0))Y\F'{e,c0))\ 

5— >-0 

Substituting  (42)  into  (41)  we  get 

lim  tB[(L(X(9  +  ^,0)  -  i(A'(9  -  ^,  {)))"]  =  M,(9) 

<5-5>0  d 

which  is  exactly  what  we  want  to  prove.  I 

The  proof  of  Lemma  3  shows  that  Assumptions  (C2)  and  (C3)  guarantee  that  the  inverse 
function  F~^{9,^)  is  sufficiently  smooth.  Assumption  (C4)  ensures  the  existence  of  Mi{9). 
Assumptions  (C2)-(C4)  are  mild.  Assumption  (Cl)  guarantees  the  smoothness  of  the  func¬ 
tion  L{X).  The  boundedness  of  L{X)  and  L'x{X)  can  be  removed  if  there  are  only  a  hnite 
number  of  sets  of  Bi{9).  The  hniteness  of  Bi{9)  can  also  relax  the  assumptions  (C2)-(C4). 

The  case  of  Mi{9)  =  0  can  only  occur  when  either  bi{9)  =  Ci{9)  or  F'{9,Ci{9))  =  0.  The 
situation  of  bi{9)  =  Ci{9)  (assuming  that  L{X)  is  not  a  constant)  happens  when  F{9,x)  is 
strictly  increasing.  A  repetition  of  the  proof  of  Lemma  3  yields  that 

Corollary  5.  If  Assumption  (Cl)  is  satisfied  and 

(43)  i))yF'(»,  i)]  <oo, 

then  F|(L(A'(«  +  S,0)  -  HX(B  -  5,0))"]  =  0(5"). 

Corollary  5  recovers  a  result  obtained  by  Glasserman  and  Yao  (1992)  under  the  assump¬ 
tion  of  Lipschitz  continuity  of  L{F~^{9,f)).  When  F'{9,Ci{9))  =  0  for  all  i,  using  the  same 
arguments  as  that  of  Corollary  5  we  can  establish  that 

Corollary  6.  In  addition  to  Assumptions  (C1)-(C4),  assume  that  F{9,Ci{9))  is  con¬ 
tinuously  twice  differentiable  for  all  i  with 

(44)  0  <  -  L{bfi9))fi\F%9,cfi9))\  <  oo. 

i 

Then,  E[{L{X{9  +  5, 0)  -  L{X{9  -  5, 0))']  =  0{5^) . 

The  following  Theorem  4  is  the  main  conclusion  of  this  subsection. 
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Theorem  4.  Assume  that  Assumptions  (Al)-(A4)  and  (C1)-(C4)  are  satisfied.  If 
Mi{6)  >  0  for  all  9,  then  the  best  convergenee  rate  for  the  KW  algorithm  (2)  with  hn  defined 
by  (33)  is  .  This  rate  is  attained  by  choosing  a„  =  an~^,  a  >  2/{5Ki),  and  5n  =  . 

Proof.  Under  Assumption  of  (C1)-(C4)  and  MfiO)  >  0,  we  know  from  Lemma  3  that 
i?[(L(X(6'  +  (5, .^))  —  L(X(6'  —  (5, .^)))^]  =  Mi{9)5  +  o(5).  According  to  (34),  the  variance  of 
hn  is  of  order  Uar[h„|6'„]  =  Mi(6*„)/5„  +  o(l/(5„).  Lemma  3  shows  that  Mi{6)  is  bounded. 
Therefore,  7  =  — 1  in  (8).  Since  (Al)-(A4)  are  satisfied.  Corollary  2  shows  that  the  best 
convergence  rate  is  I 

The  following  Theorem  5  summerizes  the  rate  of  convergence  of  the  KW  algorithm  (2) 
when  (3)  is  replaced  with  one-sided  hnite  difference  approximation  with  CRN.  The  proofs 
are  omitted  since  they  are  very  similar  to  that  of  Theorem  4. 

Theorem  5.  (I)  Under  the  same  conditions  as  those  of  Theorem  4  but  the  estimate  h 
is  replaced  by  the  following  one-sided  finite  difference  with  the  use  of  CRN 

,45)  5  _  nx(B  +  i,0)-nx(,e,0) ^ 

6  ’ 

the  best  convergence  rate  is  which  is  achieved  by  setting  On  =  an~^,a  >  l/3Ki,  and 

Sn  =  dn-^/fi 

(II)  Assume  all  the  assumptions  of  Theorem  4  except  that  MfiO)  =  0.  Then  Corollaries  5 
and  6  show  that  E[{L{X{9  -|-  5,  .^))  —  L(X(6*  —  <5,  .^)))^]  =  0(5^)  if  either  of  (43)  or  (44)  holds. 
Hence,  Var[hn\0n\  =0(1)  forhn  defined  by  either  (33)  or  (45).  The  best  convergence  rate  for 
the  KW  algorithm  (2)  is  5g  attained  by  setting  a„  =  an~^,a  >  l/2Ki, 

and  dn  =  dn~^,'r]  >  1/2. 

We  would  like  to  emphasize  that  the  assumptions  in  Corollaries  5  and  6  are  satisfied 
for  a  broad  class  of  stochastic  optimization  problems  [see  Glasserman  and  Yao  (1992)  for 
a  discussion].  Theorems  4  and  5  state  that,  when  the  inversion  method  is  used  in  the 
generation  of  random  variables  and  when  h  is  dehned  by  (33),  the  convergence  rate  for  the 
KW  algorithm  with  CRN  is  general  and  is  for  a  large  class  of  problems  that 

satisfy  the  assumptions  in  Corollaries  5  and  6.  The  improvement  is  signhcant  since  the  best 
possible  rate  for  the  same  KW  algorithm  without  CRN  is 

4.2.  Rejection  method.  Let  f{6,x)  be  the  density  function  of  X{6,f).  Assume  that,  for 
all  6*  G  0,  f{0,x)  is  zero  outside  a  hnite  interval  [a,  6]  and  is  bounded  by  0  <  f{0,x)  <  c. 
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c  >  0  is  a  constant.  The  rejection  method  generates  X{6,  according  to  the  following  three 
steps: 

1.  Generate  uniformly  distributed  on  [a,b]. 

2.  Generate  ^2  uniformly  distributed  on  [0,c]. 

3.  If  ^2  <  then  set  X{6,^)  =  .^1;  otherwise  go  to  1. 

The  rejection  method  uses  at  least  two  random  numbers  .^1  and  .^2  to  generate  X{d,^).  The 
total  number  of  random  numbers  C2  required  before  outputing  X{9,^)  is  a  random  value. 
The  rejection  method  does  not  accurately  meet  the  GRN  requirements  since  it  is  impossible 
to  define  X{6  +  S,  and  X{d  —  S,^)  using  a  fixed  set  of  uniform  random  numbers  [Bratley  et 
al.  (1983);  Franta  (1975)].  Therefore,  we  modify  the  dehnition  of  GRN  in  the  sense  dehned 
by  the  following  procedure  for  the  generation  of  a  paired  random  variables: 

Generation  of  X  {6  +  and  X{6  — 

1.  Generate  .^1  uniformly  distributed  on  [a,b]. 

2.  Generate  (,2  uniformly  distributed  on  [0,c]. 

3.  If  6  <  f{0  -  6, 6)  and  6  <  /(^  +  5,  ^i),  then  set  X{e  -6,0=  X{e  +  5,0=0- 

4.  If  0  <  f{9  —  6,0)  and  .^2  >  f{9  +  6,0):  then  set  X {6  —  6,0  =  0  and  generate  a 
X(6*  +  (5,  ^)  =  ,^3  by  the  rejection  method. 

5.  If  0  >  /(^  ~  and  0  0  f{9  +  5,  .^i),  then  set  X{6  +  5,0  =  0  and  generate  a 
X(0  —  5, 0  =  0  by  the  rejection  method. 

6.  If  0  >  -  6, 0)  and  0  >  +  6, 0),  go  to  1. 

This  is  essentially  a  coupling  procedure  [see  Devroye  (1990)  for  a  discussion  on  coupling]. 
Such  a  modihcation  is  necessary  to  mimic  the  scheme  of  GRN  using  the  rejection  method. 
We  will  soon  see  that  even  such  a  loosely  dehned  scheme  can  accelerate  the  convergence  of 
the  KW  algorithm.  Let  X(6' —  5, 0, d7(6*  +  5, 0  be  generated  by  the  preceding  procedure.  It 
is  obvious  that  E[h]  for  h  in  (33)  remains  the  same  as  that  in  the  inversion  method. 

Theorem  6.  Suppose  that  f{6,x)  is  zero  outside  [a,b],  0  <  f{0,x)  <  c  for  all  x  G  [a,b], 
and  X {9  —  6, 0 :  X {9  +  6, 0  ore  generated  by  the  previously  described  procedure.  Assume  that 
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(HI).  V ar[L{X [9 0)]  is  continuous  in  9  e  0; 


(H2).  f{9,x)  is  differentiable  in  9  for  each  x  G  [a,  6],  f{9,x)  satisfies  the  Lipschitz  condition 
with  respect  to  9,  i.e.,  there  is  a  K{x)  such  that  \f{9  +  (5,a;)  —  f{9,x)\  <  K{x)6,  and 
that  J^K{x)dx  <  oo. 


Define 


M2{9)  = 


Var[L{X{9,m 


\fQ{9,x)\dx. 


2c{b  —  a) 

Then  0  <  M2{9)  <  oo.  If  M2{9)  >  0  for  all  9,  Var[L{X{9,f,))]  is  bounded,  h  is  defined 
by  (33),  and  the  assumptions  (A1)-(A4)  are  satisfied,  then  the  convergence  rate  for  the  KW 
algorithm  with  CRN  is 

Proof.  We  see  from  the  procedure  of  generating  X{9  —  6,^)  and  X(9  +  6,^)  that, 
conditioned  on  either  ^2  <  f{d  —  or  ,^2  <  f{d  +  5, Ci);  H{9  +  <5, .C)  =  X{9  —  5,0  =  0 
when  0  <  f{d  —  5, 0)  and  0  <  /(^  +  5, 0);  otherwise  X(6'  +  5, 0  =0  and  X(6*  —  5, 0  =  0- 
Note  that  0  and  0  are  independent.  Therefore, 

(46)  Varlh]  =  )^^(Var[L((,)]  +  r<jr|L({4)l) T||/(9  +  -  f{0  -  6,(,)\] 

Under  Assumption  (HI),  Var[L{^s)]  +  Var[L{^4)]  =  2Uar[L(X(6',0)]  +  o(l).  By  Assumption 

(H2), 

U[|/(9  +  4,6)  -/(«- .5,6)11  <  W  f'‘K(x)dx 

0  b  —  a  Ja 

and  K{x)  is  integrable  on  [a,  6].  According  to  the  Weierstrass  M-test,  (46)  implies  that 
Var[h]  = 


^rar[L(A'(60)|U[l/(«  +  4.6)  -  /(»  -  4.6)11  +  o(f: 


A6(9)  1 

=  — +  ‘’0- 


\fg{0,x)\dx  +  oA 

2  0 


Thus,  we  know  from  Corollary  2  where  7  =  — 1  that  the  conclusion  follows.  I 


For  simplicity,  we  only  consider  the  simplest  form  of  the  rejection  method  and  the  case 
in  which  f{9,  x)  is  continuous.  An  analysis  similar  to  the  one  used  in  the  proof  of  Theorem  6 
shows  that  Var[h]  =  0(1/5)  remains  valid  in  the  following  three  situations:  (i)  The  estimate 
h  is  replaced  by  the  one-sided  hnite  difference  (45);  (ii)  The  density  fnnction  f{9,x)  is 
piecewise  differentiable;  (hi)  The  rejection  method  is  replaced  by  the  following  generalized 
rejection  method.  Assnme  that  there  exist  a  positive  constant  A  and  a  density  fnnction  g{x) 
such  that  f{9,x)  <  Ag{x)  for  all  9  and  for  all  x  G  [a,  b].  Then 
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1.  generate  with  the  density  function  g{x)-^ 

2.  generate  ^2  uniformly  distributed  on  [0, 745f(,^i)]; 

3.  if  ^2  <  then  set  X{6,^)  =  otherwise  go  to  1. 

It  is  easy  to  verify  that  X{9,^)  has  the  desired  distribution.  The  density  function  g{x) 
should  be  chosen  such  that  it  is  easier  to  generate  a  random  variable  with  g{x)  than  those 
with  f{e,x). 

Generally  speaking,  the  convergence  rates  for  the  KW  algorithm  are  the  same  when  either 
the  inversion  method  or  the  rejection  method  is  used  in  the  generation  of  the  random  variable 
X{6,^).  However,  the  rate  corresponding  to  the  use  of  the  rejection  method  is  universally 
true  for  any  function:  It  can  be  seen  from  its  dehnition  that  M2{0)  is  always  positive  except 
when  Var[L{X)]  =  0  or  when  f{0,x)  is  independent  of  6.  Both  cases  are  of  little  practical 
relevance.  Furthermore,  assume  that  the  assumptions  in  Theorem  6  are  satisfied  and,  in 
addition,  f{0,x)  is  strictly  positive  on  {a,b).  Then  the  best  possible  convergence  rate  for 
the  KW  algorithm  is  when  the  rejection  method  is  used  in  generating  X{9,  ^).  On  the 
other  hand,  the  distribution  function  F{9,x)  =  f  {9,  u)du  is  continuously  differentiable 
and  strictly  increasing  on  [a,  b] .  Theorem  5  shows  that  the  convergence  rate  for  the  KW 
algorithm  is  if  the  inversion  method  is  used  in  generating  X{9,^).  Therefore,  as  far 

as  the  convergence  of  the  KW  algorithm  is  concerned,  the  inversion  method  leads  to  faster 
convergence  than  the  rejection  method.  This  conclusion  is  in  favor  of  the  argument  that  the 
inversion  method  is  superior  to  the  rejection  method  [c.f.  Bratley  et  ah  (1983),  141]. 

4-3.  Composition  method.  Assume  that  the  distribution  function  F{9,x)  of  X{9,^)  is 
of  the  form 

m 

F{9,x)  =  Y.P0)F,{9,x) 

i=l 

where  Pi{9)  >  0,m  <  00,  =  1;  and  for  each  i,  Fi{9,  x)  is  a  distribution  function.  The 

composition  method  generates  the  random  variable  X{9,^)  in  the  following  way: 

1.  Generate  a  random  variable  Y  with  distribution  Prob{Y  =  i\  =  Pi{9). 

2.  If  K  =  f,  generate  X{9,^)  according  to  distribution  Fi{9,x). 

In  the  composition  method,  there  is  no  specihcation  on  the  method  for  the  generation 
of  random  variables  at  each  step.  Any  method  such  as  inversion  and  rejection  can  be  used. 
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As  an  example,  we  consider  the  case  in  which  random  variables  are  generated  using  the 
inversion  method  which  is  superior  to  the  rejection  method,  as  we  have  argued  in  the  previous 
subsection.  Dehne  po{0)  =  0,pi{9)  =  for  *  >  1-  The  following  procedure  is  the 

actual  composition  method  we  are  considering. 

1.  Generate  a  random  number  ,^i  uniformly  distributed  on  [0, 1). 

2.  If  ,^1  G  [pi-i{9) ,  pi{9)) ,  then  generate  a  random  number  ^2  uniform  on  [0,1)  and  set 

X{9,0  =  F-\9,^2). 

In  this  algorithm,  we  need  two  uniform  random  numbers  in  the  generation  of  X{9,^). 
Actually  we  can  do  with  only  one  random  number  by  setting  ^2  =  (■Ci  ~  / Pi{^) ■ 

Direct  verihcation  shows  that,  conditional  on  ,^1  G  [pi-i{9) ,  pi{9)) ,  ^2  is  uniform  on  [0,1). 
In  the  composition  method,  we  regard  that  X{9  —  S,^)  and  X{9  +  S,^)  conform  the  CRN 
requirement  if  they  are  generated  by  the  preceding  procedure  using  the  same  ^  =  {^1,^2)- 
We  can  prove  that  it  can  accelerate  the  convergence  of  the  KW  algorithm. 

For  simplicity,  we  assume  that,  for  each  i,  Fi{9,x)  =  Fi{x)  is  independent  of  9,  and  the 
number  of  distribution  component  is  hnite,  i.e.,  m  <  00.  The  case  in  which  F{9,x)  is  of 
general  form  can  be  treated  parallel  to  the  proof  of  Theorem  4.  Our  aim  here  is  to  hnd  special 
features  of  the  decomposition  method  rather  than  to  develop  the  complete  theory  which  is 
not  difficult  to  derive.  We  hrst  consider  the  situation  where  .^1  and  .^2  are  independent.  We 
then  examine  the  case  where  ^2  =  (■Ci  ~  / Pi{^) ■ 

Theorem  7.  Suppose  that  .^1  and  ^2  o^re  independent ,  Pi{9)  is  differentiable  in  9,  and 

m 

Mm  =  Y.E\(L{Fpm)  - 

i=l 

exits  and  is  finite.  If  Mfi9)  >  0  is  bounded  from  above  for  all  9,  h  is  defined  by  (33),  and 
the  assumptions  (A1)-(A4)  are  satisfied,  then  the  convergence  rate  for  the  KW  algorithm  is 

Proof.  The  proof  is  a  simplihed  version  of  that  of  Lemma  3  since  the  assumptions  here 
are  stronger.  According  to  the  generation  scheme  of  X [9  +  S,  (,)  and  X{9  —  we  have 

(47)  E[(L(X(e  +  6,0)-  L(X(e  -  6,  {)))"] 

”7  rPi(S-S) 

=  L  /  ,  ,  Eanxie  +  6,0)  -  L{X{0  -  6,im]dO 

i=i  dpt-iie-o) 
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Note  that  m  is  finite  and,  for  each  i,  Pi{9)  is  continuous.  There  exists  a  (5o  such  that  when 
5  <  (5o 

(48)  \pi{9  +  5)  -  pi{9  -  5)1  <  I  mm{pj+i{9  -  5)  -  pj{9  -  5)},  for  all  i. 

Let  us  first  consider  the  case  in  which  Pi-i{9  +  5)  >  Pi-i{9  —  5)  and  pi{9  +  5)  <  Pi{9  —  5). 
When  5  <  5o,  (48)  ensures  that  (47)  can  be  rewritten  as 


hf  rpi-i9^+^) 


E 


J Pi-i{e-5) 
^  rP^{0+S) 

rpi{6-5) 


Ei,[(L(x(e  +  i,0)  -  UX(e  - 


E^,\(L(X(»  +  i.o)  -  L{x(e  -  S,(,))f\<X.i 


E 


'J Pi{0+^) 


E^,[(L(X(e  +  6,£:))-L(X{e-6,£,))f]<iii 


=  E 


rpi-i{s+&) 


J pi-i{0-S) 
rpi(e-5) 


E 


i=\  Pi{0+^) 

m 


2=1 


T.EmFr^\(())  -  LiFpmV.mS  +  0(6) 


2  =  1 


By  considering  every  case  of  pi-i{9  +  S)  <  pi-i{9  —  S)  and  pi{9  +  6)  <  pi{9  —  S),  pi-i{9  +  6)  > 
Pi-i{9  —  6)  and  pi{9  +  5)  >  pi{9  —  5),  and  Pi-i{9  +  5)  <  Pi-i{9  —  5)  and  pi{9  +  5)  >  pi{9  —  5), 
we  obtain  that 


E[(L(x(e  +  6,0)  -  L(x(e  -  i,0))"l  =  iUe)s  +  o(<5). 

It  follows  from  (34)  that  Var[h]  =  (l/4)M3(6*)/5  +  o(l/5).  Applying  Corollary  2,  it  is  easy 
to  see  that  the  convergence  rate  for  the  KW  algorithm  is  .  We  have  thus  completed 
the  proof.  I 

Similarly,  we  can  prove  that  the  rate  remains  valid  for  the  case  in  which  the  random 
variable  X{9,^)  =  F~^{9,^2)  is  generated  by  setting  ^2  =  (■Ci  ~  Pi-i{9)) / Pi{9) . 

Theorem  8.  Suppose  that  pi{9)  >  0  is  differentiable  and  the  following  function  exists 
and  is  finite  for  all  9: 

m 

xm  =  i:[L(F,;\(i-))  -  i(F-'(0+))]"|p'(«)|, 

2=1 
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where  -^^+1(1  )  =  lim^'|-i  (,^)  and  ^(0-  U  >  0  is  bounded  for 

all  6  and  (Al)-(Af)  are  satisfied,  then  the  convergence  rate  for  the  KW  algorithm  is  . 


The  previous  Theorems  7  and  8  show  that  the  convergence  rate  for  the  KW  algorithm 
is  when  the  composition  method  is  used.  This  rate  does  not  depend  on  how  many 

random  numbers  are  used  in  the  generation  of  random  variables.  We  would  emphasize  that 
it  is  unlikely  for  each  of  Mfi6),  Mfid)  to  be  zero  in  practice. 


5.  The  MD  algorithm  with  CRN.  In  this  section,  we  examine  the  rates  of  con¬ 
vergence  for  the  MD  algorithm  under  CRN.  As  shown  in  the  previous  section,  the  use  of 
CRN  largely  affects  E[hff\  and  thus  the  reduction  of  the  variance  V ar[hfi\.  The  analysis  in 
the  previous  Section  4  provides  direct  information  on  Therefore,  in  this  section  we 

directly  work  without  going  through  V ar[hfi\.  We  may  represent  R'[/i^]  in  the  following 
form. 

(49)  E[hl]  <  c6l. 

Recall  that  7  =  — 2  for  independent  samplings  of  X{6,f)  without  CRN.  With  CRN,  7  =  — 1 
if  Mi{9)  >  0  and  7  =  0  if  MfiO)  =  0.  By  following  the  same  arguments  as  in  the  proof  of 
Theorem  3  and  applying  (49)  directly  for  Efif]  in  (27),  we  obtain  the  following  theorem. 


Theorem  9.  Assume  (B1)-(B3)  and  (49).  Then  we  have 

^  ^  n  ^  n 

(50)  E\J{K)  -  J(9*)l  <  7-  +  -  +  -  E'Sf. 


nOr. 


n 


n 


i=l  i=l 

where  C2  =  c/{2k),  Ci  and  C5  are  specified  in  Theorem  3. 

The  following  Corollary  7  summarizes  the  rates  of  convergence  for  the  MD  algorithm 
with  using  CRN  in  calculating  the  finite  difference  (33)  and  (45). 


Corollary  7.  Assume  (B1)-(B3)  and  (49).  Denote 

/O  n  n 


nar,  n 


i=l 


n 


2=1 


Then 


(i)  if''f  =  —1,  the  best  possible  rate  of  convergence  for  the  upper  bound  H{n)  is  n  when 
the  one-sided  finite  difference  (45)  is  used, 

(a)  if  '^  =  —1,  the  best  possible  rate  of  convergence  for  the  upper  bound  H{n)  is  when 
the  symmetric  finite  difference  (33)  is  used. 


(in)  z/7  =  0,  the  best  possible  rate  of  convergenee  for  the  upper  bound  H{n)  is  n  when 
either  the  one-sided  finite  differenee  (45)  or  the  symmetrie  finite  difference  (33)  is  used, 


6.  Generalization  and  applications.  In  Sections  4-5,  all  the  results  are  obtained 
for  one  dimensional  random  variables  only.  In  this  section,  we  extend  the  results  to  a  case 
of  multivariates,  which  is  not  difficult  but  very  tedious.  Assume  that  J{6)  =  E^[L{X {9,  (,))], 
where  the  multidimensional  random  variable  X{9,f)  =  [Xi{9,f),X2{0,f),...,Xm{9,f)]'^  G 
Pqj,  Xi{6,f)  =  Xi{6,f^i)  G  R,  fi  is  uniform  on  [0, 1).  We  only  consider  the  case 

in  which  each  Xi{9,  is  generated  from  using  the  inversion  method.  To  avoid  repetition, 
we  list  the  result  without  proof  which  is  very  similar  to  that  of  Theorem  5. 

Assume  that  J{6)  G  R  and  9  E  Q.  For  each  i,  1  <  i  <  m  <  cx),  let  Fi{9,x)  be  the 
distribution  function  of  Xi{9,fi)  with  the  decomposition  that 

dFi{9,x)  ^  f  0,  if  a;  G  B)j{9)  =  [hi,j{9) ,  Cij{9)] 

dx  \  fi,j{9,x),  if  a;  G  5+.(0)  =  lcij{9),bij+i{9)), 

where  [jj{B)j{9)  U  BF{9)}  =  R  for  all  i,  fij{9,  a;)  >  0  for  any  x  G  BR{9).  It  is  possible  that 
Fi{9,x)  is  discontinuous  at  bij{9). 

Theorem  10.  Assume  Assumptions  (A1)-(A4)  and,  in  addition. 


(Cl)’.  L{X)  is  continuously  differentiable  in  X,  L{X)  and  L'x.{X)  are  bounded  for  all  i; 


(C2)’.  for  each  i. 


^E[(max 


'dF,,,{9,x)Y  m,,{9,x) 


d9 


/- 


dx 


\ 


(C3)’.  for  all  i,j,  hjif))  is  continuously  differentiable  in  9,  and  max0(6'j(6*))^  <  ex; 


(C4)’.  for  all  i,j,  the  functions  Fi{9,Cij{9))  and  Fi{9,b^  j{9))  are  continuously  differentiable 
in  9,  and  max^  \F({9,  Cij{9))\  <  oo,  J2j  max^  \F({9,  b~j{9))\  <  oo, 

Define  Mi{9)  =  J2i,j{L{cij{9))  —  L{bij{9))y\F({9,Cij{9))\.  Then  Mi{9)  >  0  is  bounded.  If 
Mi{9)  >  0  for  all  9,  the  best  possible  convergence  rate  for  the  KW  algorithm  (2)  with  hn 
defined  by  (24)  is  n~‘^C _  attained  by  choosing  a„  =  an~^,  a  >  2/{5Ki),  and 


Similar  results  can  be  obtained  if  other  methods  are  used  in  the  generation  of  random 
variables  or  if  L(X)  is  a  piecewise  continuous  function  of  X.  The  analysis  can  be  applied  to 
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general  problems  such  as  Monte  Carlo  optimization  of  queueing  systems  and  other  general 
systems.  Although  such  a  generalization  is  not  trivial,  the  basic  idea  is  the  same  except  that 
the  analysis  becomes  tedious  and  lengthy.  Next  we  illustrate  an  application  of  Theorem  10 
to  the  optimization  of  queueing  systems  [see,  e.g.  Kleinrock  (1976)]. 

Example  1.  GI/G/1  queue  with  single  class  of  customers.  In  a  GI/G/1  queue, 
there  is  one  server  (such  as  a  teller  in  a  bank)  and  one  queue.  Upon  its  arrival,  a  customer 
enters  the  server  for  service  if  the  server  is  free,  otherwise  it  joins  the  queue  and  waits  for  its 
turn.  The  service  discipline  is  first-come-first-serve.  The  server  cannot  be  free  if  there  is  at 
least  one  customer  waiting  in  the  queue.  Assume  that  the  distribution  of  interarrival  times 
is  Gait)  and  the  distribution  of  service  times  is  Gs{9,t)  =  p{6)G\{t)  +  (1  —  p{6))G‘l{t) .  For 
simplicity,  we  assume  that  Gait),  Gl(t),  and  G^it)  are  independent  of  9  and  / t^dG{{t)  < 
+cx),j  =  1,2,  p{9)  is  continuously  differentiable  in  9,  Ga(t),Gi(t),j  =  1,2,  are  strictly 
increasing  and  continuously  differentiable  in  t.  In  queueing  theory,  the  system  time  of  a 
customer  is  dehned  as  the  time  period  from  its  arrival  till  departure.  Let  L{X{9,^))  be  the 
average  system  time  of  the  hrst  N  customers 

1  N 

i(A'(e,0)  =  ^Er.(«.0. 

i=l 

where  Ti{9,^)  is  the  system  time  of  the  Ah  customer.  Then  J{9)  =  E[L{X{9,^))]  is  the 
mean  system  time  of  the  first  N  customers.  We  want  to  find  the  optimal  parameter  9*  to 
minimize  J{9).  It  is  known  that  the  analytical  form  of  J{9)  is  not  available  for  general  Ga{t), 
Gl(t),  and  Gl(t)  [e.g.  Kleinrock  (1976)].  So  we  hnd  9*  via  the  KW  algorithm.  Assume  that 
the  queue  is  initially  empty.  According  to  Lindley’s  equation  [e.g.  Kleinrock  (1976)]: 

(51)  T,(0,O=max{T,_i(0,O-A,O}  +  A„  To{9,0  =  0, 

where  A*  is  the  interarrival  time  between  the  {i  —  l)th  and  the  Ah  customer.  Si  is  the  service 
time  of  the  Ah  customer.  The  distributions  of  A*  and  S'*  are  respectively  Gait)  and  Gs{9,t)- 
We  consider  two  scenarios. 

Case  1.  We  hnd  9*  through  computer  simulation.  We  write  a  program  to  simulate 
the  GI/G/1  queue.  At  the  nth  iteration,  we  perform  two  experiments  with  the  same 
to  obtain  a  hn  that  is  dehned  by  (33).  Consider  that  the  inversion  method  is  used  in 
the  generation  of  random  variables  Aj  =  G~^{ui),Si  =  G~^{9,Vi),i  =  1,2,.. .,N.  Dehne 
the  random  factor  as  .^  =  [ui,U2,  ...,U]\f,Vi,V2,  ...,vp^]'^ ,  A(,^)  =  [Ai,  A2, ...,  A^r],  S{9,^)  = 
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[S'!,  S'2, S'at],  and  X{9,^)  =  [74(,^),  S'(6*,  Since  the  fnnction  niax{a;,  0}  is  continnous 
in  x,  L{X)  is  continnons  in  X.  According  to  Theorem  10,  we  know  that  Mi{6)  =  0  since 
both  Ga{t)  and  Gs{9,t)  are  strictly  increasing  and  continnously  differentiable  in  t.  Note 
that  L{X)  is  not  differentiable  in  X.  However,  L{X)  is  left  and  right  differentiable  with 
bonnded  one-sided  derivatives.  A  simple  modihcation  of  the  proof  of  Corollary  3  shows 
that  Var[hn]  =  0(1).  Therefore,  the  convergence  rate  for  the  KW  algorithm  is  jf 

the  composition  method  is  used  in  the  generation  of  S{6,^)  according  to  the  distribution 
Gs{6,t),  then  from  Theorems  7  and  8  (which  is  applicale  to  the  case  of  multivariates)  we 
know  that  the  rate  of  convergence  is 

Case  2.  Assume  that  this  is  a  real  system  and  we  want  to  perform  on-line  parameter 
adjustment.  Let  visualize  n  as  the  nth  day  of  service.  Suppose  that  the  server  serves  more 
than  N  customers  each  day.  At  the  nth  day,  the  server  serves  customers  with  parameter 
value  9n  and  simultaneously  collects  information  of  X{9n,^n)  which  simply  is  a  record  of 
interarrival  times  {A”}  and  service  times  {(S'""}.  At  the  end  of  the  nth  day,  the  server 
calculates 

v,  =  G,{9^,S:),t  =  l,2,3,...,N. 

It  is  easy  to  verify  that  each  Vi  is  uniform  on  [0, 1).  Then  the  server  defines  from  the 
preceding  Vi,  i  =  1,2, N,  takes  a  >  0,  and 


S{9n  +  Sn,  Cn)  =  ...,  S^/],  Af’"  =  G;\9n  +  5n,  G,{9n,  3^)),  t  =  l,2,  ...,  iV; 

S{9n  -  5n,  Cn)  =  ^2 3^/],  =  G:\9r,  -  G',(0„,  3^)),  i  =  l,2,...,N. 

If  Gs{9,t)  =  1  —  is  exponential,  then  =  {9n  +  dn)3^/9n,  3'^'“^  =  {9n  —  5n)3'^/9n- 
With  the  values  of  A(,^„),  3{9n  +  6n,  Cn),  3{9n  +  Sn,  Cn),  from  (40)  and  the  form  of  L{X{9,^)), 
the  server  computes  L{X{9n  -|-  Sn,^n))  and  L{X{9n  —  Sn,^n)),  which  determines  a  With 
this  hn,  the  server  updates  the  parameter  9n+i  according  to  the  KW  algorithm  (2)  for  the 
next  (n  -|-  l)th  day.  In  such  a  way,  we  have  formulated  an  on-line  optimization  problem  that 
mimics  the  Monte  Carlo  optimization.  Its  convergence  can  be  analyzed  similarly  to  that 
of  Case  1.  Our  purpose  here  is  simply  to  point  out  that  the  results  of  this  paper  are  not 
restricted  to  Monte  Carlo  optimization. 


7.  Summary.  So  far,  we  have  examined  several  variations  of  the  KW  algorithm  and 
the  MD  algorithm  under  the  symmetric  finite  difference,  the  one-sided  hnite  difference,  and 
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the  use  of  CRN  when  different  methods  are  used  in  the  generation  of  random  variables.  The 
results  of  this  paper,  together  with  previous  results  on  the  KW  algorithm  without  the  use 
of  CRN  [c.f.  Fabian  (1971);  Kushner  and  Clark  (1978)],  provide  a  complete  view  toward  the 
rates  of  convergence  for  the  KW  algorithm.  For  the  ease  of  comparison,  we  summarize  all 
the  results  in  the  following  table. 


Table  I.  Rates  of  convergence  for  the  KW /MD  algorithm 


with  CRN 
h{33) 

with  CRN 
h(45) 

without  CRN 
fc(3) 

without  CRN 
h{23) 

inversion; 

Mi(0)^O 

n-215 

^-1/3 

n-V3 

n-iy 

inversion: 
Mi(0)  =  0 

n-i/2 

n-i/2 

n-V3 

n-iy 

rejection; 

general 

n-215 

n-ys 

n-V3 

n-iy 

composition: 

general 

n-2/5 

n-ys 

n-V3 

n-iy 

In  Table  I,  h{3),  h{23),  h{33),  and  h(45)  refer  to  the  hnite-difference  approximation  hn 
dehned  by  (3),  (23),  (33),  and  (45),  respectively.  The  phrase  “without  CRN”  refers  to 
using  independent  samples  in  calculating  the  hnite  difference  {hn},  which  excludes  sampling 
schemes  that  may  lead  to  correlations  between  the  samples.  In  other  words,  “without  CRN” 
simply  means  that  and  ^2,n  are  independent  in  (3)  and  (23).  When  the  inversion  method 
is  used  in  the  generation  of  random  variables  and  when  Mi{6)  =  0,  we  assume  that  Corollaries 
3  and  4  are  applicable.  Results  pertaining  to  the  KW  algorithm  without  the  use  of  CRN 
can  be  found  in,  for  example,  Fabian  (1971),  and  Kushner  and  Clark  (1978). 

Generally  speaking,  the  use  of  CRN  is  always  helpful  in  accelerating  the  convergence  of 
the  KW  algorithm  or  the  MD  algorithm.  In  some  cases,  such  as  when  Mi  (6)  =  0  in  Theorem 
5,  CRN  helps  a  lot.  In  some  of  other  cases,  CRN  may  help  less  much.  When  the  inversion 
method  is  used  and  when  Mi{6)  =  0,  the  convergence  rate  can  reach  the  best  possible 
rate  for  the  two  types  of  stochastic  approximation  algorithms.  The  remark  at  the  end  of 
Subsection  3.2  shows  that,  as  far  as  the  convergence  rate  of  the  KW  algorithm  is  concerned, 
the  inversion  method  is  superior  to  the  rejection  method.  Note  that  inversion  can  also  be  used 
to  generate  random  variables  with  distributions  of  the  form  J2iPi{h)Fi{6,x).  A  comparison 
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of  Theorem  4  and  Theorems  7  and  8  shows  that  inversion  is  also  superior  to  composition. 
When  the  distribution  function  F{6,x)  of  X{6,^)  is  strictly  increasing  and  continuous,  a 
close  examination  of  the  inversion,  rejection,  and  composition  methods  shows  that  X{6,^) 
is  continuous  in  6  if  it  is  generated  from  inversion.  However,  X{6,^)  is  discontinuous  in  6 
if  it  is  generated  from  either  rejection  or  composition.  It  is  such  a  distinction  of  continuity 
that  determines  the  rates  of  the  convergence  for  the  KW  algorithm. 
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